2024-11-10The New York City Taxi and Limousine Commission (TLC), created in 1971, is the agency responsible for licensing and regulating New York City’s medallion (yellow) taxis, street hail livery (green) taxis, for-hire vehicles (FHVs), commuter vans, and paratransit vehicles. The TLC cooperates with taxi technology providers (now called technology service providers, or TSPs) to collect trip record information for each taxi and FHV trip completed by licensed drivers and vehicles.
Taxi trip data can be acquired from the TLC website (https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page), where the trip records are published, separated by year, month and vehicle type (yellow/green/FHV/High-volume FHV). Among the four vehicle types, we will first narrow our target down to yellow and green taxis, as they are the “traditional” taxi types that respond to street hails, as well as being incorporated under a more reliable source of data collection, contrary to FHV trip records that rely on corporations such as Uber, Lyft, etc.
Regarding the two taxi types, we can easily observe that the usage of green, or the “boro” taxis, are very limited compared to yellow taxis; this is mainly due to their specific purpose of serving outer boroughs, which limits vehicles from picking up new passengers within the “yellow zone” of Manhattan, or within airports. This has led to 86% plunge in numbers of operating green cabs from 6,500 in 2015 to less than 900 in 2023.
It is clear that the nature of green taxis does not fit our purpose of understanding taxi demand patterns across all of New York City, as well as leading to very limited usage compared to yellow taxis. Therefore, the dataset used for this analysis will consist of trip record data of yellow taxis only.
Reference (decline of green taxi population): (https://www.nbcnewyork.com/news/local/green-cabs-are-being-phased-out-heres-what-will-replace-them/4302496/#:~:text=The%20Taxi%20and%20Limousine%20Commission,%25%20plunge%2C%20The%20City%20reported.)
## # A tibble: 6 × 18
## vendor_name Trip_Pickup_DateTime Trip_Dropoff_DateTime Passenger_Count
## <chr> <chr> <chr> <int>
## 1 VTS 2009-01-04 02:52:00 2009-01-04 03:02:00 1
## 2 VTS 2009-01-04 03:31:00 2009-01-04 03:38:00 3
## 3 VTS 2009-01-03 15:43:00 2009-01-03 15:57:00 5
## 4 DDS 2009-01-01 20:52:58 2009-01-01 21:14:00 1
## 5 DDS 2009-01-24 16:18:23 2009-01-24 16:24:56 1
## 6 DDS 2009-01-16 22:35:59 2009-01-16 22:43:35 2
## # ℹ 14 more variables: Trip_Distance <dbl>, Start_Lon <dbl>, Start_Lat <dbl>,
## # Rate_Code <dbl>, store_and_forward <dbl>, End_Lon <dbl>, End_Lat <dbl>,
## # Payment_Type <chr>, Fare_Amt <dbl>, surcharge <dbl>, mta_tax <dbl>,
## # Tip_Amt <dbl>, Tolls_Amt <dbl>, Total_Amt <dbl>
Raw dataset provided by the TLC consists of 18 columns, with quite
self-explanatory column names. Our key variables would be:
- Trip_Pickup_DateTime and Trip_Dropoff_DateTime, representing temporal
information; - Lon/Lat columns, representing spatial information.
However, for the spatial portion of the data, it has seen a major change recently. The TLC no longer provides the coordinates of pickup/dropoff locations, that are replaced by location ID information that represents which “taxi zone” that each location falls into. While the exact lon/lat would better serve our purpose, only the records from 2009 and 2010 are available in such format. Since the scope of this analysis is to address recent trends of taxi demand, we have decided to rely on the taxi zone shapefile, also provided by the TLC, to address spatial nature of taxi data.
We will use the yellow taxi trip records of August 2024, which is the most recent data that has been published by the TLC, with sample size of 1 million trips.
ggplot()+
annotation_map_tile(type = "osm", zoom = 12) +
geom_sf(data = zoneshp)+
labs(title = "NYC Taxi Zone Boundaries",
x = "Longitude", y = "Latitude") +
theme_minimal()
## Zoom: 12
The TLC has divided New York City into 263 taxi zones, which are represented by “PULocationID”,“DOLocationID” columns in the dataset. In order to provide coordinates for each row in order to conduct spatial analysis, we calculate centroid coordinates for each taxi zone and add it to the original dataset:
## PULocationID DOLocationID PU_Longitude PU_Latitude DO_Longitude
## 308175 237 161 -73.96563 40.76862 -73.97770
## 338694 100 186 -73.98879 40.75351 -73.99244
## 54621 161 114 -73.97770 40.75803 -73.99738
## 36044 100 13 -73.98879 40.75351 -74.01608
## 802089 75 75 -73.94575 40.79001 -73.94575
## 77387 163 162 -73.97757 40.76442 -73.97236
## DO_Latitude
## 308175 40.75803
## 338694 40.74850
## 54621 40.72834
## 36044 40.71204
## 802089 40.79001
## 77387 40.75669
Six columns, as shown above, provides spatial information, especially the last 4 parameters that was calculated based on the first two from original data.
## tpep_pickup_datetime tpep_dropoff_datetime Duration
## 1 2024-08-10 16:40:15 2024-08-10 16:58:01 17.766667
## 2 2024-08-01 13:21:49 2024-08-01 13:37:04 15.250000
## 3 2024-08-29 12:22:11 2024-08-29 12:30:48 8.616667
## 4 2024-08-16 14:53:22 2024-08-16 14:57:06 3.733333
## 5 2024-08-10 02:00:23 2024-08-10 02:22:08 21.750000
## 6 2024-08-14 08:27:33 2024-08-14 08:42:31 14.966667
Another parameter, ‘Duration’ has been added by calculating the gap between pickup time and dropoff time for each row in minutes. Data points with missing pickup and/or dropoff time, as well as negative ‘Duration’ value were removed in the process.
trip_distance, the total traveled distance of each trip recorded in miles, is another parameter that played a crucial role in the data cleaning process, since many data points had abnormal values, such as zero or extremely large numbers. Such data points were treated as outliers and have been removed, resulting in 821022 rows remaining.
ggplot()+
annotation_map_tile(type = "osm", zoom = 12) +
geom_sf(data = zoneshp, aes(fill = borough))
## Zoom: 12
Taxi zone shapefile also provides useful information about the TLC’s taxi zone system. Based on the New York City’s subdivision (‘borough’) information from the data, we can potentially conduct borough-based analysis as well. Furthermore, it is notable that the legend consists of ‘EWR’ as well as five boroughs of NYC, with only one taxi zone, ‘Newark Airport’ under this category. While the Newark International Airport is administratively no longer a part of NYC since October of 2022, it still covers a notable volume of NYC residents’ travels due to its proximity for certain regions. The TLC has seemingly acknowledged such history and decided to keep EWR considered as a within-NYC taxi zone.
Total counts of pickup/dropoff occurrences from each zone is a simple, yet effective way of understanding patterns of taxi demand.
ggplot()+
annotation_map_tile(type = "osm", zoom = 12) +
geom_sf(data = zoneshp_2, aes(fill = PUCounts))+
scale_fill_distiller(palette = "Spectral")+
labs(title = "NYC Yellow Cab Pickup Counts by Taxi Zone",
x = "Longitude", y = "Latitude") +
theme_minimal()
## Zoom: 12
We can further retrieve top 10 zones in terms of pickup counts for more details:
top_n(zoneshp_2, 10, PUCounts)[,5:7]
## Simple feature collection with 10 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 981976.6 ymin: 206851.1 xmax: 998281.4 ymax: 226338.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
## zone borough PUCounts
## 1 Lincoln Square East Manhattan 25633
## 2 Midtown Center Manhattan 43679
## 3 Midtown East Manhattan 32666
## 4 Murray Hill Manhattan 27706
## 5 Penn Station/Madison Sq West Manhattan 32872
## 6 Times Sq/Theatre District Manhattan 29769
## 7 Union Sq Manhattan 25291
## 8 Upper East Side North Manhattan 32212
## 9 Upper East Side South Manhattan 38639
## 10 East Chelsea Manhattan 25916
## geometry
## 1 MULTIPOLYGON (((989380.3 21...
## 2 MULTIPOLYGON (((991081 2144...
## 3 MULTIPOLYGON (((992224.4 21...
## 4 MULTIPOLYGON (((991999.3 21...
## 5 MULTIPOLYGON (((986752.6 21...
## 6 MULTIPOLYGON (((988786.9 21...
## 7 MULTIPOLYGON (((987029.8 20...
## 8 MULTIPOLYGON (((995940 2211...
## 9 MULTIPOLYGON (((993633.4 21...
## 10 MULTIPOLYGON (((983690.4 20...
Manhattan borough, especially midtown Manhattan area, is dominant in terms of pickup counts.
zoneshp_2 %>% filter(is.na(PUCounts))
## Simple feature collection with 20 features and 7 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1049020 ymax: 230302.8
## Projected CRS: NAD83 / New York Long Island (ftUS)
## First 10 features:
## LocationID OBJECTID Shape_Leng Shape_Area
## 1 103 104 0.02122083 1.192053e-05
## 2 103 105 0.07742534 3.686364e-04
## 3 103 103 0.01430552 6.330564e-06
## 4 109 109 0.17826782 1.169601e-03
## 5 110 110 0.10394629 5.257451e-04
## 6 111 111 0.05993088 2.086833e-04
## 7 118 118 0.24396622 1.826939e-03
## 8 156 156 0.14447689 1.052122e-03
## 9 172 172 0.11847612 6.584025e-04
## 10 187 187 0.12686843 4.211958e-04
## zone borough PUCounts
## 1 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 2 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 3 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 4 Great Kills Staten Island NA
## 5 Great Kills Park Staten Island NA
## 6 Green-Wood Cemetery Brooklyn NA
## 7 Heartland Village/Todt Hill Staten Island NA
## 8 Mariners Harbor Staten Island NA
## 9 New Dorp/Midland Beach Staten Island NA
## 10 Port Richmond Staten Island NA
## geometry
## 1 MULTIPOLYGON (((973172.7 19...
## 2 MULTIPOLYGON (((979605.8 19...
## 3 MULTIPOLYGON (((972079.6 19...
## 4 MULTIPOLYGON (((943392.6 14...
## 5 MULTIPOLYGON (((951420.1 13...
## 6 MULTIPOLYGON (((985590.4 17...
## 7 MULTIPOLYGON (((954167.8 16...
## 8 MULTIPOLYGON (((934327.5 17...
## 9 MULTIPOLYGON (((960204.8 14...
## 10 MULTIPOLYGON (((946964.1 17...
There are also several taxi zones in gray, which have NA values for ‘PUCounts’ column, due to having no pickup from such locations recorded in this particular dataset. However, it is notable that ‘Governor’s Island/Ellis Island/Liberty Island’ (row 1,2 and 3) will always have zero pickup counts, since these areas can only be accessed by ferry boats.
ggplot()+
annotation_map_tile(type = "osm", zoom = 12) +
geom_sf(data = zoneshp_2, aes(fill = DOCounts))+
scale_fill_distiller(palette = "Spectral")+
labs(title = "NYC Yellow Cab Dropoff Counts",
x = "Longitude", y = "Latitude") +
theme_minimal()
## Zoom: 12
## Simple feature collection with 10 features and 4 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 981976.6 ymin: 208788.5 xmax: 998281.4 ymax: 226338.3
## Projected CRS: NAD83 / New York Long Island (ftUS)
## zone borough PUCounts DOCounts
## 1 Lenox Hill West Manhattan 20507 22965
## 2 Lincoln Square East Manhattan 25633 22808
## 3 Midtown Center Manhattan 43679 34938
## 4 Midtown East Manhattan 32666 27157
## 5 Murray Hill Manhattan 27706 28809
## 6 Times Sq/Theatre District Manhattan 29769 28543
## 7 Upper East Side North Manhattan 32212 33258
## 8 Upper East Side South Manhattan 38639 34043
## 9 Clinton East Manhattan 23813 23028
## 10 East Chelsea Manhattan 25916 24247
## geometry
## 1 MULTIPOLYGON (((994839.1 21...
## 2 MULTIPOLYGON (((989380.3 21...
## 3 MULTIPOLYGON (((991081 2144...
## 4 MULTIPOLYGON (((992224.4 21...
## 5 MULTIPOLYGON (((991999.3 21...
## 6 MULTIPOLYGON (((988786.9 21...
## 7 MULTIPOLYGON (((995940 2211...
## 8 MULTIPOLYGON (((993633.4 21...
## 9 MULTIPOLYGON (((986694.3 21...
## 10 MULTIPOLYGON (((983690.4 20...
Dropoffs are also very concentrated in central Manhattan area.
zoneshp_2 %>% filter(is.na(DOCounts))
## Simple feature collection with 19 features and 8 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: 913175.1 ymin: 120121.9 xmax: 1049020 ymax: 246613
## Projected CRS: NAD83 / New York Long Island (ftUS)
## First 10 features:
## LocationID OBJECTID Shape_Leng Shape_Area
## 1 103 104 0.02122083 1.192053e-05
## 2 103 103 0.01430552 6.330564e-06
## 3 103 105 0.07742534 3.686364e-04
## 4 109 109 0.17826782 1.169601e-03
## 5 110 110 0.10394629 5.257451e-04
## 6 118 118 0.24396622 1.826939e-03
## 7 156 156 0.14447689 1.052122e-03
## 8 176 176 0.15199519 6.577821e-04
## 9 187 187 0.12686843 4.211958e-04
## 10 199 199 0.07780850 2.887475e-04
## zone borough PUCounts
## 1 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 2 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 3 Governor's Island/Ellis Island/Liberty Island Manhattan NA
## 4 Great Kills Staten Island NA
## 5 Great Kills Park Staten Island NA
## 6 Heartland Village/Todt Hill Staten Island NA
## 7 Mariners Harbor Staten Island NA
## 8 Oakwood Staten Island 1
## 9 Port Richmond Staten Island NA
## 10 Rikers Island Bronx NA
## DOCounts geometry
## 1 NA MULTIPOLYGON (((973172.7 19...
## 2 NA MULTIPOLYGON (((972079.6 19...
## 3 NA MULTIPOLYGON (((979605.8 19...
## 4 NA MULTIPOLYGON (((943392.6 14...
## 5 NA MULTIPOLYGON (((951420.1 13...
## 6 NA MULTIPOLYGON (((954167.8 16...
## 7 NA MULTIPOLYGON (((934327.5 17...
## 8 NA MULTIPOLYGON (((950393.9 14...
## 9 NA MULTIPOLYGON (((946964.1 17...
## 10 NA MULTIPOLYGON (((1015024 230...
Aside from aforementioned three islands, there are also zones without taxi dropoff records, or no pickup/dropoff records at all.
This part was not included to the final report; while the ‘Duration’ variable was crucial in data cleaning process, the team decided that further analysis on this variable was irrelevant to the goal of this research.
summary(data$Duration)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.03333 7.01667 11.01667 12.35400 16.35000 40.23333
ggplot(data, aes(x= "Trips", y=Duration))+
geom_boxplot(notch=TRUE, fill = "yellow", alpha = 0.3, size = 1.2)+
labs(title = "Boxplot of Yellow Taxi Trip Durations")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
#version 1
ggplot(data, aes(x=Duration))+
geom_histogram(color="darkblue", fill="lightblue")+
labs(title = "Distribution of Yellow Taxi Trip Duration")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the plots, we can observe that the duration of yellow taxi rides
average in 12 minutes, with most rides lasting less than 10 minutes and
very few rides last 20 minutes or longer.
ggplot(data, aes(x=as.Date(tpep_pickup_datetime)))+
geom_bar(color="darkred", fill="red", alpha = 0.3)+
labs(title = "Daily Yellow Taxi Pickup Counts of August 2024")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
ggplot(data, aes(x=as.Date(tpep_pickup_datetime)))+
geom_histogram(color="darkred", fill="red", alpha = 0.3)+
labs(title = "Daily Yellow Taxi Pickup Counts of August 2024")+
theme_minimal()+
theme(plot.title = element_text(hjust = 0.5))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Bar plot and histogram from the same data show different visuals, suggesting that the ‘tpep_pickup_datetime’ variable from the original data may not be in the proper date & time format.
Since our analysis on taxi demand considers both space and time, time-related variables from the data set would need further processing. Overview of pickup counts based on time periods - hour, weekday, mainly - would be necessary to better reflect the goal of this project.